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ABSTRACT 

The Resource Description Framework (RDF) is a W3C standard 
for representing graph-structured data, and SPARQL is the standard 
query language for RDF. Recent advances in Information Extrac¬ 
tion, Linked Data Management and the Semantic Web have led to 
a rapid increase in both the volume and the variety of RDF data 
that are publicly available. As businesses start to capitalize on 
RDF data, RDF data management systems are being exposed to 
workloads that are far more diverse and dynamic than what they 
were designed to handle. Consequently, there is a growing need 
for developing workload-adaptive and self-tuning RDF data man¬ 
agement systems. To realize this vision, we introduce a fast and 
efficient method for dynamically clustering records in an RDF data 
management system. Specifically, we assume nothing about the 
workload upfront, but as SPARQL queries are executed, we keep 
track of records that are co-accessed by the queries in the work¬ 
load and physically cluster them. To decide dynamically (hence, 
in constant-time) where a record needs to be placed in the stor¬ 
age system, we develop a new locality-sensitive hashing (LSH) 
scheme, Tunable-LSH. Using Tunable-LSH, records that are 
co-accessed across similar sets of queries can be hashed to the same 
or nearby physical pages in the storage system. What sets TUNA- 
BLE-LSH apart from existing LSH schemes is that it can auto-tune 
to achieve the aforementioned clustering objective with high accu¬ 
racy even when the workloads change. Experimental evaluation of 
Tunable-LSH in our prototype RDE data management system, 
chameleon-db, as well as in a standalone hashtable shows signifi¬ 
cant end-to-end improvements over existing solutions. 


1. INTRODUCTION 

Physical data organization plays an important role in the per¬ 
formance tuning of database management systems. A particularly 
important problem is clustering (in the storage system) records that 
are frequently co-accessed by queries in a workload. Suboptimal 
clustering has negative performance implications due to random 
FO and cache stalls Q. This problem has received attention in the 
context of SQL databases and has led to the introduction of tuning 



Eigure 1: Adaptive record placement using a combination of 
adaptive hashing and caching Q. 


advisors that work either in an offline 013 or online fashion (i.e., 
self-tuning databases) |24| . 

In this paper, we address the problem in the context of RDE data 
management systems. SPARQL workloads are far more dynamic 
than SQL workloads |13[|43| ; yet, tuning techniques for RDF data 
management systems are in their infancy, and relational solutions 
are not directly applicable. More specifically, depending on the 
workload, it might be necessary to completely change the under¬ 
lying physical representation in an RDF data management system, 
such as by dynamically switching from a row-oriented representa¬ 
tion to a columnar representation On the other hand, existing 
online tuning techniques work well only when the schema changes 
are minor |20| . Consequently, with the increasing demand to sup¬ 
port highly dynamic workloads in RDF |13||43| , there is a growing 
need to develop more adaptive tuning solutions, in which records 
in an RDF database can be dynamically and continuously clustered 
based on the current workload. 

Whenever a SPARQL query is executed, there is an opportunity 
to observe how records in an RDF database are being utilized. This 
information about query access patterns can be used to dynami¬ 
cally cluster records in the storage system. Dynamism is important 
in RDF systems because of the high variability and dynamism in 
SPARQL workloads p3l[4^ . While this problem has been studied 
as physical clustering |47| and distribution design |22| , the highly 
dynamic nature of the queries over RDF data introduces new chal¬ 
lenges. First, traditional algorithms are offline, and since clustering 
is an NP-hard problem and most approximations have quadratic 
complexity | [42) , they are not suitable for online database cluster¬ 
ing. Instead, techniques are needed with similar clustering objec¬ 
tives, but that have constant running time. Second, systems are typ¬ 
ically expected to execute most queries in subseconds | |50| , leav¬ 
ing only fractions of a second to update their physical data struc¬ 
tures (i.e., in our case, we are concerned with dynamically moving 
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records across the storage system). 

We address the aforementioned issues hy making two contrihu- 
tions. First, as shown in Fig. instead of clustering the whole 
database, we cluster only the “warm” portions of the database by 
relying on the admission policy of the existing database cache. Sec¬ 
ond, we develop a self-tuning locality-sensitive hash (LSFI) func¬ 
tion, namely, Tunable-LSFI to decide in constant-time where in 
the storage system to place a record. Tunable-LSFI has two im¬ 
portant properties; 

• It tries to ensure that (i) records with similar utilization pat¬ 
terns (i.e., those records that are co-accessed across simi¬ 
lar sets of queries) are mapped as much as possible to the 
same pages in the storage system, while (ii) minimizing the 
number of records with dissimilar utilization patterns that are 
falsely mapped to the same page. 

• Unlike conventional LSH |30|40| , Tunable-LSH can auto¬ 
tune so as to achieve the aforementioned clustering objec¬ 
tives with high accuracy even when the workloads change. 

These ideas are illustrated in Fig.[T] Let us assume that initially, 
the records in a database are not clustered according to any partic¬ 
ular workload. Therefore, the performance of the system is sub- 
optimal. However, every time records are fetched from the storage 
system, there is an opportunity to bring together into a single page 
those records that are co-accessed but are fragmented across the 
storage system. Tunable-LSH achieves these with minimal over¬ 
head. Furthermore, Tunable-LSH is continuously updated to re¬ 
flect any changes in the workload characteristics. Consequently, as 
more queries are executed, records in the database become more 
clustered, and the performance of the system improves. 

The paper is organized as follows: Section discusses related 
work. Section gives a conceptual description of the problem. 
Section [^describes the overview of our approach while Section]^ 
provides the details. In Section|^ we describe how physical cluster¬ 
ing takes place in the database, in particular, how Tunable-LSH 
can be used in an RDF data management system, and we experi¬ 
mentally evaluate our techniques. Finally, we discuss conclusions 
and future work in Section|7] 

2. RELATED WORK 

Locality-sensitive hashing (LSH) |30[|40| has been used in vari¬ 
ous contexts such as nearest neighbour search |12|14|28|36|40|56l , 
Web document clustering |18[|19| and query plan caching (7J. In 
this paper, we use LSH in the physical design of RDF databases. 
While multiple families of LSH functions have been developed |19| 
|23[|25[[^|40| , these functions assume that the input distribution is 
either uniform or static. In contrast, Tunable-LSH can contin¬ 
uously adapt to changes in the input distribution to achieve higher 
accuracy, which translates to adapting to changes in the query ac¬ 
cess patterns in the workloads in the context of RDF databases. 

Physical design has been the topic of an ongoing discussion in 
the world of RDF and SPARQL p][^55||5^ . One option is to rep¬ 
resent data in a single large table |21| and build clustered indexes, 
where each index implements a different sort order |34[|52|[57) . 
It has also been argued that grouping data can help improve per¬ 
formance For this reason, multiple physical representa¬ 

tions have been developed: in the group-by-predicate representa¬ 
tion, the database is vertically partitioned and the tables are stored 
in a column-store jT); in the group-by-entity representation, im¬ 
plicit relationships within the database are discovered (either man¬ 
ually (5^ or automatically (l7|), and the RDF data are mapped 
to a relational database; and in the group-by-vertex representation. 


RDF’s inherent graph-structure is preserved, whereby data can be 
grouped on the vertices in the graph (6Q| . These workload-oblivious 
representations have issues for different types of queries, due to 
reasons such as fragmented data, unnecessarily large intermediate 
result tuples generated during query evaluation and/or suboptimal 
pruning by the indexes Q. 

To address some of these issues, workload-aware techniques have 
been proposed |31[|35| . For example, view materialization tech¬ 
niques have been implemented for RDF over relational engines (H). 
However, these materialized views are difficult to adapt to chang¬ 
ing workloads for reasons discussed in Section[T] Workload-aware 
distribution techniques have also been developed for RDF |35| and 
implemented in systems such as WARP |35| and Partout (29| , but, 
these systems are not runtime-adaptive. With Tunable-LSH, we 
aim to address the problem adaptively, by clustering fragmented 
records in the database based on the workload. 


While there are self-tuning SQL databases 

■o3| 

Q0| 

39 

and tech- 

niques for automatic schema design in SQL Q 

15| 

47| 

59|, these 


techniques are not directly applicable to RDF. In RDF, the ad¬ 
vised changes to the underlying physical schema can be drastic, 
for example, requiring the system to switch from a row-oriented 
representation to a columnar one, all at runtime, which are hard 
to achieve using existing techniques. Consequently, there have 
been efforts in designing workload-adaptive and self-tuning RDF 
data management systems |^[^[^[^|^. In H2RDF (53| , the 
choice between centralized versus distributed execution is made 
adaptively. In PHDStore 0J, data are adaptively replicated and dis¬ 
tributed across the compute nodes; however, the underlying physi¬ 
cal layout is fixed within each node. A mechanism for adaptively 
caching partial results is introduced in (54| . With Tunable-LSH, 
we are trying to address the adaptive record layout problem, there¬ 
fore, we believe that Tunable-LSH will complement existing 
techniques and facilitate the development of runtime adaptive RDF 
systems. 

3. PRELIMINARIES 

Given a sequence of database records that represent the records’ 
serialization order in the storage system, the access patterns of a 
query can conceptually be represented as a bit vector, where a bit is 
set to 1 if the corresponding record in the sequence is accessed by 
the query. We call this bit vector a query access vector (q). 

Depending on the system, a record may denote a single RDF 
triple (i.e., the atomic unit of information in RDF), as in systems 
like RDF- 3x |51| , or a collection of RDF triples such as in chamele- 
on-db m ij. Our conceptual model is applicable either way. 

As more queries are executed, their query access vectors can be 
accumulated column-by-column in a matrix, as shown in Fig. 

We call this matrix a query access matrix. For presentation, let 
us assume that queries are numbered according to their order of 
execution by the RDF data management system. 

Each row of the query access matrix constitutes what we call a 
record utilization vector (f), which represents the set of queries that 
access record r. As a convention, to distinguish between a query 
and its access vector (likewise, a record and its utilization vector), 
we use the symbols q and q (likewise, r and f), respectively. The 
complete list of symbols are given in Table[T] 

To model the memory hierarchy, we use an additional notation 
in the matrix representation: records that are physically stored to¬ 
gether on the same disk/memory page should be grouped together 
in the query access matrix. For example, Fig.|^and Fig.|2b|repre- 
sent two alternative ways in which the records in an RDF database 
can be clustered (groups are separated by horizontal dashed lines). 
Even though both figures depict essentially the same query access 
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(a) Representation at t = 8. (b) Clustered on rows 
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Symbol Description 

uj database size (i.e., number of records) 
e number of pages in the storage system 
k maximum no. of query access vectors that can be stored 
b number of entries in each record utilization counter 
t current time 

q query access vector (contains lo bits) 
r record utilization vector (contains k bits) 
c record utilization counter (contains b entries) 

P depending on the context, a point in a /c-dimensional 
or 6-dimensional (Taxicab) space 
X k query access matrix; contains the last k most 

representative query access vectors (in columns), 
or equivalently, uj record utilization vectors (in rows) 

C'aj X fa frequency matrix; represents record utilization frequency 
over b groups of query access vectors 
q[i\ value of the z® bit in query access vector q 
r[i\ value of the bit in record utilization vector r 
c[i] value of the P entry in record utilization counter c 
P[i] valueof the coordinate in point P 
M\i\ [y] value of the z^" row and j column in matrix 
C[i] [y] value of the z'^ row and j'^ column in matrix 
(5(r%, ry) Hamming distance between two record utilization 
vectors 

{Qx : qy) MIN-HASH distance between two query access vectors 
^ ^ Manhattan distance between two points 


(c) Clustered on rows and columns (d) Grouping of (e) Alternative 

bits grouping 

Figure 2: Matrix representation of query access patterns. 


patterns, the physical organization in Fig.|^is preferable, because 
in F ig.[^ most queries require access to 4 pages each, whereas in 
Fig.|2b[ the number of accesses is reduced by almost half. 

Given a sequence of queries and the number of pages in the stor¬ 
age system, our objective is to store records with similar utiliza¬ 
tion vectors together so as to minimize the total number of page 
accesses. To determine the similarity between record utilization 
vectors, we rely on the following property. Two records are co¬ 
accessed by a query if both of the corresponding bits in that query’s 
access vector are set to 1. Extending this concept to a set of queries, 
we say that two records are co-accessed across multiple queries if 
the corresponding bits in the record utilization vectors are set to 
1 for all the queries in the set. For example, according to Fig. 
records ri and rs are co-accessed by queries go and q^, and records 
ro and re are co-accessed across the queries qi-qz. 

Given a sequence of queries, it may often be the case that a pair 
of records are not co-accessed in all of the queries. Therefore, to 
measure the extent to which a pair of records are co-accessed, we 
rely on their Hamming distance (32| . Specifically, given two record 
utilization vectors for the same sequence of queries, their Ham¬ 
ming distance—denoted as 5{qx, tfy) —is defined as the minimum 
number of substitutions necessary to make the two bit vectors the 
same (3^P] Hence, the smaller the Hamming distance between a 
pair of records, the greater the extent to which they are co-accessed. 

Consider the record utilization vectors ro, rl, r's and r'e across 
the query sequence go-97 in Fig. The pairwise Hamming dis¬ 
tances are as follows: (5(ro, rf) = 1, 5{r2,rf) = 1, 5(ro, rs) = 3, 
3(ro,r2) ~ 4, S(r 5 , re) = 4 and 5{r2,re) = 5. Consequently, 
to achieve better physical clustering, we should try to store ro and 
re together and r 2 and rs together, while keeping ro and re apart 
from r 2 and rs. 


'The Hamming distance between two record utilization vectors 
is equal to their edit distance (46| , as well as the Manhattan dis¬ 
tance j44| between these two vectors in Zi norm. 


Table 1: Symbols used throughout the manuscript 

4. OVERVIEW OF TUNABLE-LSH 

Although we are dealing with a clustering problem, the dynamic 
nature of queries over RDF data necessitate a solution different 
than existing ones j^. That is, while conventional clustering al¬ 
gorithms |42| might be perfectly applicable for the offline tuning 
of a database, in an online scenario, even the most efficient al¬ 
gorithms may be impractical unless records are clustered on-the- 
fly and within microseconds. Clustering is an NP-complete prob¬ 
lem |42| , and most approximations take at least quadratic time. It is 
not very well-understood which clustering algorithm is more suit¬ 
able for which types of input distributions Q, let alone the fact 
that incremental versions of these algorithms are largely domain- 
specific Q. Therefore, we develop Tunable-LSH, which is a 
self-tuning locality-sensitive hash (LSH) function. As records are 
fetched from the storage system, we keep track of records that 
are fragmented. Then, we use use Tunable-LSH to decide, in 
constant-time, how a fragmented record needs to be clustered in 
the storage system (cf., Fig.[TJ. Furthermore, we develop methods 
to continuously auto-tune this LSH function to adapt to changing 
query access patterns that we encounter while executing a work¬ 
load. This way, Tunable-LSH can achieve much higher cluster¬ 
ing accuracy than conventional LSH schemes, which are static. 

Let Zq ... ^ denote the set of integers in the interval [a, /3], and 
let Z))^ denote the n-fold Cartesian product: 

Za ■■■ g X XZq.../3. 

- ’v' 

n 

Let us assume that we are given a non-injective, surjective function 
/ : Zo...(fc_i) —>■ Zq. where 6 <C fc, and for all y € 
Zq... (i,-i), it holds that 

|{x : /(x) = i/}| < |"|]. 

In other words, / is a hash function with the property that, given k 
input values and b possible outcomes, no more than [values in 
the domain of the function will be hashed to the same value. Then, 
we define Tunable-LSH as h : Zq... i —>■ Zq... (e-i), where e 
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represents the number of pages in the storage system. More specif¬ 
ically, h is defined as a composition of two functions hi and h 2 - 

Definition 1 (Tunable-LSH). 

Let 

r = (r[ 0 ],..., r[A:— 1 ]) € Zq ... i, and 
c= (c[ 0 ],...,c[&-l]) G Zq.., 

Then, a tunable LSH function h is defined as 


Algorithm 1 Initialize 

Ensure: 

Record utilization counters are allocated and initialized 
1: procedure Initialize() 

2! construct int C[cD][2bl > For simplicity, C is allocated 

statically; however, in prac¬ 
tice, it can be allocated dynam¬ 
ically to reduce memory foot¬ 
print. 

3: for alH G {0, . . . , cj — 1) do 

4: for all j G (0, . . . , 2b — 1) do 

5: C[i][j]^0 

6: end for 

7: end for 

81 end procedure 


/i = h2 o hi 


where 


hi : Z§...i ->■ Zg where hi{f) = ciff 


'^V cM = 


r[x] 

0 


: fix) = y 
■■ f ix) 7 ^ y 


^2 : ->• Zq... (e_i), where /12(c) = v iff 


V = { 


coordinate ofc(rounded to the 
nearest integer) on a space-filling 
curve li49^ of length t that covers 


According to Def.[^ h is constructed as follows: 

1. Using a hash function / (which can be treated as a black 
box for the moment), a record utilization vector r with k 
bits is divided into b disjoint segments fo,... ,fi,_i such 
that tq, , ft,_i contain all the bits in r, and each f) € 
{ro,..., has at most [bits. Then, a record uti¬ 

lization counter c with b entries is computed such that the 
i* entry of c (i.e., c\i]) contains the number of 1-bits in iy. 
Without loss of generality, a record utilization counter c can 
be represented as a 6-dimensional point in the coordinate sys- 


2. The final hash value is computed by ordering the points in 


k 1 using a space-filling curve 

I 6 ' 


49 


In Section [5T| we show that Tunable-LSH that maps fe-dimen- 
sional record utilization vectors to natural numbers in the interval 
[0 ,..., e — 1] is locality-sensitive, with two important implica¬ 
tions: (i) records with similar record utilization vectors (i.e., small 
Hamming distances) are likely going to be hashed to the same 
value, while (ii) records with dissimilar record utilization vectors 
are likely going to be separated. Therefore, the problem of cluster¬ 
ing records in the storage system can be approximated using Tu- 
NABLE-LSH, such that clustering n records takes 0(n) time. 

The quality of Tunable-LSH, that is, how well it approximates 
the original Hamming distances, depends on two factors: (i) the 
characteristics of the workload so far, which is reflected by the bit 
distribution in the record utilization vectors, and (ii) the choice of 
/. In Section we demonstrate that / can be tuned to adapt 
to the changing patterns in record utilization vectors to maintain 
the approximation quality of Tunable-LSH at a steady and high 
level. 


Algorithm 2 Tune 

Require: 



qt: query access vector produced at time t 


Ensure: 


1 

Underlying data structures are updated and / is 
maintains a steady approximation quality 
procedure TUNE(^) 

tuned such that the LSH function 

2 

REC0NF1GURE-F((jl) 


3 

for all i G PosiTlONALlij)) do 


4 

loc <- /(t) 


5 

if loc < (shift % b) then 


6 

loc += b 


7 

end if 


8 

C[i][loc]-i--h 

> Increment record utilization 



counters based on the new 
query access pattern 

9: 

= Othen 

> Reset “old” counters 

10 

: shift++ 


11 

: C[i][(shift-|-b)%2h] <— 0 


12 

end if 


13 

end for 


14 

: end procedure 



Algorithms[TJ|^present our approach for computing the outcome 
of Tunable-LSH and for incrementally tuning the LSH function 
every time a query is executed. Note that we have two design con¬ 
siderations: (i) tuning should take constant-time, otherwise, there 
is no point in using a function, (ii) the memory overhead should 
be low because it would be desirable to maximize the allocation 
of memory to core database functionality. Consequently, instead 
of relying on record utilization vectors, the algorithm computes 
and incrementally maintains record utilization counters (cf.. Algo¬ 
rithm [T} that are much easier to maintain and that have a much 
smaller memory footprint due to the fact that b k. Then, when¬ 
ever there is a need to compute the outcome of the LSH function 
for a given record, the Hash procedure is called with the id of the 
record, which in turn relies on 62 to compute the hash value (cf.. 
Algorithm!^. 

The Tune procedure looks at the next query access vector, and 
updates / (line|^, which will be discussed in more detail in Sec¬ 
tion |5^ Then it computes positions of records that have been ac¬ 
cessed by that query (line [^, and increments the corresponding 
entries in the utilization counters of those records that have been 
accessed (line[^. To determine which entry to increment, the algo¬ 
rithm relies on hi, hence, /(/) (cf., Def.[^ and a shifting scheme. 
In linefTT] old entries in record utilization counters are reset based 
on an approach that we discuss in Section [53] In that section we 
also discuss the shifting scheme. 


5. DETAILS OF TUNABLE-LSH AND OP¬ 
TIMIZATIONS 
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Algorithm 3 Hash 

Require: 

id: id of record whose hash is being computed 

Ensure: 

Hash value is returned 
1: procedure HASH(id) 

2: return Z-VALUE(C[id]) > Apply/i 2 

3; end procedure 


This section is structured as follows: Section o shows that 
Tunable-LSH has the properties of a locality-sensitive hashing 
scheme. Section |5^ describes our approach for tuning / based on 
the most recent query access patterns, and Section [53| explains how 
old bits are removed from record utilization counters. 

5.1 Properties of Tunable-LSH 

In this part, we discuss the locality-sensitive properties of hi 
and /i 2 , and demonstrate that /i 2 o hi can be used for clustering 
the records. First, we show the relationship between record utiliza¬ 
tion vectors and the record utilization counters that are obtained by 
applying hi. 

Theorem 1 (Distance Bounds). Given a pair of record 
utilization vectors fi and r 2 with size k, let ci and C 2 denote two 
record utilization counters with size b such that ci = (ii(fi) and 
C2 = hi{f 2 ) (cf, De/^. Furthermore, let Ci[i\ and C2[i\ denote 
the T* entry in ci and C 2 , respectively. Then, 

b-l 

5{ri,r2) >'^\ci[i]-C2\i]\ ( 1 ) 

i=0 

where S{fi, ff) represents the Hamming distance between fi and 
T2- 


Proof Sketch 1. We prove Thm. \I\by induction on b. 

Base case: Thm.^holds when 6=1. According to Def.^ when 
6=1, Ci[0] and C2[0] correspond to the total number of 1-bits in 
fi and r 2 , respectively. Note that the Hamming distance between 
fi and r 2 will be smallest if and only if these two record utilization 
vectors are aligned on as many 1-bits as possible. In that case, they 
will differ in only | ci [0] — C 2 [0] | bits, which corresponds to their 
Hamming distance. Consequently, Eq n. \l\ holds for 6=1. 
Inductive step: We show that if Eqn. IH holds for b < a, where a 
is a natural number greater than or equal to 1, then it must also 
hold for 6 = a -b 1. Let H / (r, g) denote a record utilization vector 
r' = (t'[0 ], ..., r'[k — 1]) such that for all i € {0, ..., k — 1}, 
r'[i\ = r[i\ holds iff(i) = g, and r'[i] = 0 otherwise. Then, 

6-1 

5{ri,r2) = ^6(n/(fi,fl),n/(r2,fir)). (2) 

9=0 

That is, the Hamming distance between any two record utiliza¬ 
tion vectors is the summation of their individual Hamming dis¬ 
tances within each group of bits that share the same hash value 
with respect to f. This property holds because f is a (total) func¬ 
tion, and H/ masks all the irrelevant bits. Ai an abbreviation, let 
Sg = S{Ilf{ri,g),Ilf{r 2 ,g)). Then, due to the same reasoning as 
in the base case, for g = a, the following equation holds: 

5c(tT,r5) > |ci[a] - C 2 HI (3) 


Thm. [T] suggests that the Hamming distance between any two 
record utilization vectors rl and rl can be approximated using 
record utilization counters cl = 61 (rl) and cl = 61 (rl) because 
Eqn. [^provides a lower bound on 6(rl, rl). In fact, the right-hand 
side of Eqn. [I] is equal to the Manhattan distance || 4 ^ between cl 
and cl in ^ , and since 6(rl, rl) is equal to the Manhattan 

distance between f{ and rl in Zq... 1 , it is easy to see that hi is 
a transformation that approximates Manhattan distances. The fol¬ 
lowing corollary captures this property. 


Corollary 2 (Distance Approximation). Givenapair 
of record utilization vectors r 1 and fl with size k, let cl and cl de¬ 
note two points in the coordinate system Z^ i,, such that cl = 

0 ' b ' 

. Let d^\fi, rl) denote the 
Manhattan distance between rl and rl, and let 6^(cl, cl) denote 
the Manhattan distance between cl and cl. Then, the following 
holds: 


S{ri,r 2 ) = 6*^(rl,rl) > 6*^(cl,cl) 


(4) 


Proof Sketch 2. Hamming distance in Zg...! is a special 
case of Manhattan distance. Furthermore, by definition 1)44^ , the 
right hand side ofEqn^equals the Manhattan distance 5 (cl, cl); 
therefore, Eqn.^holds. ■ 

Next, we demonstrate that hi is a locality-sensitive transfor¬ 
mation [30[|40| . In particular, we use the definition of locality- 
sensitiveness by Tao et al. (56| , and show that the probability that 
two record utilization vectors rl and rl are transformed into “near¬ 
by” record utilization counters cl and cl increases as the (Manhat¬ 
tan) distance between n and r 2 decreases. 


Theorems (Good Approximation). Given a pair of re¬ 
cord utilization vectors rl and rl with size k, let cl and cl de¬ 
note two points in the coordinate system Z^ .| such that cl = 

hi{fi), cl = 6i(rl) andb = 1 (cf, De/[^. Le?(5*^(rl,rl) denote 
the Manhattan distance between rl and r 2 , and let (5*^(ci, cl) de¬ 
note the Manhattan distance between cl and cl. Furthermore, let 
PR 5 "<e(®) ® shorthand for 


PR 


^6*^(cl,cl) < 6 I 6*^(rl,rl) = 2 ;j. 


Then, 


L^J 

E 


PR,5M<0(x) — 


i=r- 


(5) 


where Q,x G Zg f-l 0 < x. 


Proof Sketch 3. If the Hamming/Manhattan distance between 
rl and rl is x, then it means that these two vectors will differ in ex¬ 
actly X bits, as shown below. 


rl : 
rl : 


□□□ 111...1 0 ooonnn 

□ □□000---O 1...111 □□□ 


Consequently, using the additive property in Eqn. it can be shown 
that Eqn^holds also for 6 = a -|- 1. Thus, by induction, Thm. 
holds. ■ 


Furthermore, iffi has A -|- a bits set to 1, then rl must have A -|- 
(x — a) bits set to 1, where A denotes the number of matching 1- 
bits between rl and rl and a G {0,..., x}. Note that when 6=1, 
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Figure 3: PRjm<© for k = 2A and 6 = 6 


the Manhattan distance between di and 62 is equal to the difference 
in the number of 1-bits that 7 ^ and rS have. Hence, 

(5*^(ci, C 2 ) = IA + a: — a — (A + o)| 

= \x — 2 a\. 


It is easy to see that there are {x 1) different configurations: 

a = 0 

=> 5’^{ci,C2)=X 

a = 1 

=> 6“^(cl, cl) = x—2 

a = a; — 1 

=7 6“^(cl, cl) = x—2 

a = X 

=> 5 ’^ {C 1 ,C 2 ) = X. 

Only when a = { (- 

#1, ■ • = }. will S^{cl,C 2 ) < 0 be 


satisfied. For each satisfying value of a, the non-matching bits in 
rl and rj can be combined in possible ways. Therefore, there 
are a total of 



combinations such that 5^^{cl, cj) < 0. Since there are 2^ possi¬ 
ble combinations in total, the posterior probability in Thm. ^60 Wi. 


Using Thm. it is possible to show that when 6=1, for all 
O < X where 0 , x € Zq ... (k- 2 ), the following holds: 

PrgM<e{x) > PrsM<e{x-\-2). ( 6 ) 

Therefore, hi is locality-sensitive for 6 = 1. Due to space limi¬ 
tations, we omit the proof of Eqn. but in a nutshell, the proof 
follows from the fact that going from a; to a: -|- 2, the denominator 
in Eqn. [^always increases by a factor of 4, whereas the numerator 
increases by a factor that is strictly less than 4. 

Generalizing Thm. and Eqn. to cases where 6 > 2 is more 
complicated. Flowever, our empirical analyses across multiple val¬ 
ues of k and 6 demonstrate that 

PrsM^eix) > PrgM<Q{y) 

holds when y ^ x. Eor example, Eig.plshows Prgm^ q{x) when 
fc = 24 and 6 = 6. Fig.[^ along witn^our empirical evaluations, 
verify that hi is locality sensitive. Thus, combined with a space¬ 
filling curve, it can be used to approximate the clustering problem. 


5.2 Achieving and Maintaining Tighter Bounds 
on Tunahle-LSH 

Next, we demonstrate how it is possible to reduce the approx¬ 
imation error of hi. We first define load factor of an entry of a 
record utilization counter. 


Definition 2 (Load Factor). Given a record utilization 
counter c = (c[0],..., c[6 — 1]) with size b, the load factor of the 
entry is c[i\. 


Theorem 4 (Effects of Grouping). Given two record uti¬ 
lization vectors rl and rl with size k, let cl and 62 denote two 
record utilization counters with size 6=1 such that cl = 6,1 (rl) 
and C 2 = 61 (rl). Then, 


( <5^(cl, cl 

=<5"(rl,rl) 


Ci[0] = (1 AND 
C2[0] = I2 


= 7 


(7) 


where 


and 


(t)(6l) 


(8) 


Imax = max{ll,l2) 

Imin = min{li, I 2 ). 


Proof Sketch 4. Let Fmax denote the record utilization vec¬ 
tor with the most number of 1-bits among fl and rl, and let Tmin 
denote the vector with the least number of 1-bits. When 6 = 1, 
{cl, C2) = S{fl, rl) holds if and only if the number of 1-bits 
on which fl and rl are aligned is l,„i„ because in that case, both 
{cl, C2) andS{fl,f 2 ) are equal to Imax—lmin (note that {cl, cl) 
is always equal to Imax — Imm). Assuming that the positions of 1-bits 
in frnax are fixed, there are possible ways of arranging the 
1-bits offmm such that S{fi,f 2 ) = Imax — Imin- Sincc the 1-bits of 
fmax can be arranged in (; * ) different ways, there are (*'”“) (; ^ ) 

combinations such that S^{ci,C 2 ) = S{fi,f 2 ). Note that in to¬ 
tal, the bits of fl and rl can be arranged in {^’^ ) (;* ) possible 
ways; therefore, Eqns. describe the posterior probability 

that 5^{cl, C 2 ) = 6(rl,rl), given ci[0] = h and C 2 [ 0 ] = h. ■ 

According to Eqns. I^andj^in Thm.|^ the probability that 6*^ (cl ,- 
C2) is an approximation of (^rl , ?!), but that it is not exactly equal 
to 6(rl, rl) is lower for load factors that are close or equal to zero 
and likewise for load factors that are close or equal to (cf., 
Fig.g. This property suggests that by carefully choosing /, it is 
possible to achieve even tighter error bounds for hi. 
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Symbol Description 

begin natural number between 0 ... (fc — 1), initial value is 0 

size natural number between 0 ... (fc — 1), keeps track of 

the number of query access vectors that are 
cun'ently being maintained, initial value is 0 
Hfcx? matrix that contains MIN-HASH values 
for each query access vector 

Sf] array of vector{s), one for each MDS query point, that 

pairs each MDS queiy point with a random subset of points 
NW array of max-heap{s), one for each MDS queiy point, that 

pairs each MDS query point with a set of neighbormg points 
X\] array of float(ii), represents the coordinate 

(single dimensional) of each MDS query point 
y [] array of float{^), represents the current 

(directional) velocity of each MDS query point 


Table 2: Data structures referenced in algorithms 


Contrast the matrices in Fig and Fig|^ which contain the 
same query access vectors, but the columns are grouped in two dif¬ 
ferent way0 (i) in Fig. 2b the grouping is based on the original 
sequence of execution, and (ii) in Fig. queries with similar ac¬ 
cess patterns are grouped together. Fig.|2d|and Fig. [^represent the 
corresponding record utilization counters for the record utilization 
vectors in the matrices in Fig.|^and Fig.|^ respectively. Take fl 
and rs, for instance. Their actual Hamming distance with respect to 
qo-<l 7 is 8. Now consider the transformed matrices. According to 
Fig.[^ the Hamming distance lower bound is 0, whereas according 


to Fig.|2e| it is 8. Clearly, the bounds in the second representation 
are closer to the original. The reason is as follows. Even though 
rs and 7 % differ on all the bits for qo-qr, when the bits are grouped 
as in Fig. the counts alone cannot distinguish the two bit vec¬ 
tors. In contrast, if the counts are computed based on the grouping 
in Fig. (which clearly places the 1-bits in separate groups), the 
counts indicate that the two bit vectors are indeed different. 

The observations above are in accordance with Thm. |4] Conse¬ 
quently, we make the following optimization. Instead of randomly 
choosing a hash function, we construct / such that it maps queries 
with similar access vectors (i.e., columns in the matrix) to the same 
hash value. This way, it is possible to obtain record utilization 
counters with entries that have either very high or very low load 
factors (cf., Def. [TJ, thus, decreasing the probability of error (cf., 
Thm.|4](. 

We develop a technique to efficiently determine groups of queries 
with similar access patterns and to adaptively maintain these groups 
as the access patterns change. Our approach consists of two parts: 
(i) to approximate the similarity between any two queries, we rely 
on the Min-Hash scheme |18| , and (ii) to adaptively group similar 
queries, we develop an incremental version of a multidimensional 
scaling (MDS) algorithm (48) . 

Min-Hash offers a quick and efficient way of approximating 
the similarity, (more specifically, the Jaccard similarity gTJ), be¬ 
tween two sets of integers. Therefore, to use it, the query access 
vectors in our conceptualization need to be translated into a set of 
positional identifiers that correspond to the records for which the 
bits in the vector are set to l|^ For example, according to Fig.j^ 
q{ should be represented with the set {0, 5, 6} because ro, rs and 
re are the only records for which the bits are set to 1. Note that, we 
do not need to store the original query access vectors at all. In fact, 
after the access patterns over a query are determined, we compute 
and store only its Min-Hash value. This is important for keeping 
the memory overhead of our algorithm low. 


Algorithm 4 Reconfigure-F 

Require: 

qt: query access vector produced at time t 

Ensure: 

Coordinates of MDS points are updated, which are used in determining the out¬ 
come of / 

1: procedure RECONFIGURE-F(ql) 

2; pos ■<— (begin + size) % k 

3: S[pos].clear() 

4: N[pos].clear() 

5! X[pos] < -0.5 + rand{) j RAND-MAX 

6: y[pos] ■<— 0 

7: /f[pos] MlN-HASH(ql) 

8 : if size < k then 

9) size += 1 

10: else 

11: begin — (begin -\- 1) % k 

12: end if 

1 3: for i ■<— 0, 2 < size, 2 ++ do 

14: X -(r- (begin + i) % k 

15: UPDATE-S-AND-N(fc) 

16: Update-VELO ciTY(fc) 

17: end for 

18: for 2 0, 2 < size, 2 ++ do 

19: X (begin + i) % fe 

20: UPDATE-COORDINATES(a;) 

21: end for 

22: end procednre 


Queries with similar access patterns are grouped together using 
a multidimensional scaling (MDS) algorithm |45| that was origi¬ 
nally developed for data visualization, and has recently been used 
for clustering (16| . Given a set of points and a distance function, 
MDS assigns coordinates to points such that their original distances 
are preserved as much as possible. In one efficient implementa¬ 
tion (48 1 , each point is initially assigned a random set of coor¬ 
dinates, but these coordinates are adjusted iteratively based on a 
spring-force analogy. That is, it is assumed that points exert a force 
on each other that is proportional to the difference between their 
actual and observed distances, where the latter refers to the dis¬ 
tance that is computed from the algorithm-assigned coordinates. 
These forces are used for computing the current velocity (V in 
Table and the approximated coordinates of a point (X in Ta¬ 
ble]^. The intuition is that, after successive iterations, the system 
will reach equilibrium, at which point, the approximated coordi¬ 
nates can be reported. Since computing all pairwise distances can 
be prohibitively expensive, the algorithm relies on a combination 
of sampling (5[] in Table and maintaining for each point, a list 
of its nearest neighbours (V[] in Table|^ —only these distances are 
used in computing the net force acting on a point. Then, the nearest 
neighbours are updated in each iteration by removing the most dis¬ 
tant neighbour of a point and replacing it with a new point from the 
random sample if the distance between the point and the random 
sample is smaller than the distance between the point and its most 
distant neighbour. 

This algorithm cannot be used directly for our purposes because 
it is not incremental. Therefore, we propose a revised MDS algo¬ 
rithm that incorporates the following modifications: 

1. In our case, each point in the algorithm represents a query 
access vector. However, since we are not interested in visu¬ 
alizing these points, but rather clustering them, we configure 
the algorithm to place these points along a single dimension. 
Then, by dividing the coordinate space into consecutive re¬ 
gions, we are able to determine similar query access vectors. 


^Groups are separated by vertical dashed lines. 

^In practice, this translation never takes place because the system 
maintains positional vectors to begin with. 


2. Instead of computing the coordinates of all of the points at 
once, our version makes incremental adjustments to the co¬ 
ordinates every time reconfiguration is needed. 
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Algorithm 5 Hash Function / 

Require: 

t: sequence number of a query access veetor 

Ensure: 

f(t) is computed and returned 
1: procedure F(f) 

2: pos ^ t % k 

3: (lo, hi) •<—GROUP-BOUNDS(X[pos]) 

4: coid •(— CENTROID(lo, hi) 

5: return HASH(coid) % b 

6: end procedure 



0 1 2 3 4 5 

t = 0 

□ □ □ 0 

t = rVal 

□ □ □ 0 

t = PVal 

□ □ □ 0 

t = k 

0 □ □ □ 

t = r/si 

□ 0 □ □ 

t = rvsi 

□ □ 0 □ 


Figure 5: Assuming & = 3, □ indicates the allowed locations at 
each time tick, and 0 indicates the counter to be reset. 


The revised algorithm is given in Algorithmic First, the algo¬ 
rithm decides which MDS point to assign to the new query ac¬ 
cess vector qi (linejC. It clears the array and the heap data struc¬ 
tures containing, respectively, (i) the randomly sampled, and (ii) the 
neighbouring set of points (lines [I®. Furthermore, it assigns 
a random coordinate to the point within the interval [—0.5, 0.5] 
(line IC, and resets its velocity to 0 (line j^. Next, it computes 
the Min-Hash value of qi and stores it in ii’[pos] (line|7 


Then, 


T^'n 


15 I, 


it makes two passes over all the points in the system (lines 
while first updating their sample and neighbouring lists (line 
computing the net forces acting on them based on the Min-Hash 
distances and updating their velocities (line|16[l; and then updating 
their coordinates (line|20[l. 

The procedures used in the last part are implemented in a similar 
way as the original algorithm [48 j ; that is, in line[TC the sam¬ 
pled points are updated, in lineTTb the velocities assigned to the 
MDS points are updated, and in line ^ the coordinates of the MDS 
points are updated based on these updated velocities. However, our 
implementation of the Update-Velocity procedure (line|16|l is 
slightly different than the original. In particular, in updating the 
velocities, we use a decay function so that the algorithm forgets 
“old” forces that might have originated from the elements in S[] 
and NW that have been assigned to new query access vectors in the 
meantime. Note that unless one keeps track of the history of all the 
forces that have acted on every point in the system, there is no other 
way of “undoing” or “forgetting” these “old” forces. 

Given the sequence number of a query access vector (t), the out¬ 
come of the hash function / is determined based on the coordinates 
of the MDS point that had previously been assigned to the query 
access vector by the RECONFIGURE procedure. To this end, the 
coordinate space is divided into b groups containing points with 
consecutive coordinates such that there are at most [|] points in 
each group. Then, one option is to use the group identifier, which 
is a number in Zq ... i,- 1 , as the outcome of /, but there is a problem 
with this naive implementation. Specifically, we observed that even 
though the relative coordinates of MDS points within the “same” 
group may not change significantly across successive calls to the 
Reconfigure procedure, points within a group, as a whole, may 
shift. This is an inherent (and in fact, a desirable) property of the 
incremental algorithm. However, the problem is that there may be 
far too many cases where the group identifier of a point changes just 
because the absolute coordinates of the group have changed, even 
though the point continues to be part of the “same” group. To solve 
this problem, we rely on a method of computing the centroid within 
a group by taking the Min-Hash of the identifiers of points within 
that group such that these centroids rarely change across succes¬ 
sive iterations. Then, we rely on the identifier of the centroid, as 
opposed to its coordinates, to compute the group number, hence, 
the outcome of /. The pseudocode of this procedure is given in 
Algorithmic 

We make one last observation. Internally, Min-Hash uses mul¬ 
tiple hash functions to approximate the degree to which two sets 


are similar |18| . It is also known that increasing the number of 
internal hash functions used (within Min-Hash) should increase 
the overall accuracy of the Min-Hash scheme. However, as un¬ 
intuitive as it may seem, in our approach, we use only a single 
hash function within Min-Hash, yet, we are still able to achieve 
sufficiently high accuracy. The reason is as follows. Recall that 
Algorithm 1C relies on multiple pairwise distances to position ev¬ 
ery point. Consequently, even though individual pairwise distances 
may be inaccurate (because we are just using a single hash function 
within Min-Hash), collectively the errors are cancelled out, and 
points can be positioned accurately on the MDS coordinate space. 

5.3 Resetting Old Entries in Record Utiliza¬ 
tion Counters 

Once the group identifier is computed (cf., Algorithmic, it should 
be straightforward to update the record utilization counters (cf., 
line|Cin Algorithmic. However, unless we maintain the original 
query access vectors, we have no way of knowing which counters to 
decrement when a query access vector becomes stale, as maintain¬ 
ing these original query access vectors is prohibitively expensive. 
Therefore, we develop a more efficient scheme in which old values 
can also be removed from the record utilization counters. 

Instead of maintaining h entries in every record utilization counter, 
we maintain twice as many entries (2b). Then, whenever the Tune 
procedure is called, instead of directly using the outcome of f(t) 
to locate the counters to be incremented, we map fit) to a loca¬ 
tion within an “allowed” region of consecutive entries in the record 
utilization counter (cf., linej^in Algorithmj^. At every [f]* it¬ 
eration, this allowed region is shifted by one to the right, wrapping 
back to the beginning if necessary. Consider Fig.|^ Assuming that 
6 = 3 and that at time t = Q the allowed region spans entries from 
0 to (6 — 1), at time t — [|], the region will span entries from 1 
to 6; at time t = k, the region will span entries from 6 to 26 — 1; 
and at time t = [^], the region will span entries 0 and those from 
6-hi to 26-1. 

Since f{t) produces a value between 0 and 6 — 1 (inclusive), 
whereas the entries are numbered from 0 to 26 — 1 (inclusive), the 
Reconfigure procedure in Algorithmic uses f{t) as follows. If 
the outcome of f{t) directly corresponds to a location in the al¬ 
lowed region, then it is used. Otherwise, the output is incremented 
by 6 (cf., line 1C in Algorithmic. Whenever the allowed region is 
shifted to the right, it may land on an already incremented entry. 
If that is the case, that entry is reset, thereby allowing “old” values 
forgotten (cf., line [m in Algorithmic. These are shown by 0 in 
Fig. 0 This scheme guarantees any query access pattern that is less 
than k steps old is remembered, while any query access pattern that 
is more than 2 k old is forgotten. 


6. EXPERIMENTAL EVALUATION 
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19.4 

18.8 

44.0 

24.5 

17.0 

93.0 

WatDiv lOOM 

40.4 

42.0 

71.4 

210.3 

96.4 

62.7 

767.2 


Table 3: Query execution time, geometric mean 
(milliseconds) 0 
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(b) WatDiv lOOM triples 


Figure 6: Comparison of chameleon-db implemented using a 
hierarchical clustering algorithm and with Tunable-LSH 


In this section, we evaluate Tunable-LSH in three sets of ex¬ 
periments. First, we evaluate it within chameleon-db, our proto¬ 
type RDF data management system m- Second, we evaluate it 
within a hashtable implementation since hashtahles are used ex¬ 
tensively in RDF data management systems. Finally, we evaluate 
Tunable-LSH in isolation, to understand how it behaves under 
different types of workloads. All experiments are performed on a 
commodity machine with AMD Phenom II x4 955 3.20 GHz pro¬ 
cessor, 16 GB of main memory and a hard disk drive with 100 GB 
of free disk space. The operating system is Uhuntu 12.04 LTS. 

6.1 Tunable-LSH in chameleon-db 

The first experiment evaluates Tunable-LSH within chameleon- 
db, which is our prototype RDF data management system (10| . In 
particular, in earlier work, we had introduced a hierarchical cluster¬ 
ing algorithm for grouping RDF triples into what we call group-by¬ 
query clusters 0- In this evaluation, we replace that hierarchical 
clustering algorithm with Tunable-LSH, and study its implica¬ 
tions on the end-to-end query performance, keeping the same ex¬ 
perimental configuration. For completeness, we quote our descrip¬ 
tion of the experimental setup from our previous paper: 

“For our evaluations, we [primarily] use the Waterloo SPARQL 
Diversity Test Suite (WatDiv) because it facilitates the generation 
of test cases that are far more diverse than any of the existing bench¬ 
marks 1^. In this regard, we use the WatDiv data generator to 
create two datasets: one with 10 million RDF triples and another 
with 100 million RDF triples (we observe that systems under test 
(SUT) load data into main memory on the smaller dataset whereas 
at lOOM triples, SUTs perform disk I/O). Then, using the Wat¬ 
Div query template generator, we create 125 query templates and 
instantiate each q uery template with 100 queries, thus, obtaining 
12500 queries^’ 0 

We compare our approach with chameleon-db implemented with 
the hierarchical clustering algorithm (abbreviated CDB [ICDE’ 15 ]) 
and “five popular systems, namely, RDF-3x | [ 5 T) , MonetDB (37). 
4Store and Virtuoso Open Source (VOS) versions 6.1 (Z7) and 

''http : / / db . uwaterloo . ca/watdiv/ 
stress-workloads.tar.gz 


7.1 |26| . RDF-3x follows the single-table approach and creates 
multiple indexes; MonetDB is a column-store, where RDF data are 
represented using vertical partitioning JT); and the last three sys¬ 
tems are industrial systems. Both 4Store and VOS group and index 
data primarily based on RDF predicates, but VOS 6.1 is a row-store 
whereas VOS 7.1 is a column-store. We configure these systems so 
that they make as much use of the available main memory as pos¬ 
sible.” (TT) 

“We evaluate each system independently on each query template. 
Specifically, for each query template, we first warm up the system 
by executing the workload for that query template once (i.e., 100 
queries). Then, we execute the same workload five more times (i.e., 
500 queries). We report average query execution time over the last 
five workloads.” 0 

“Our prototype starts with a completely segmented clustering, 
where each cluster consists of a single triple.” 0 However, “after 
the execution of the 100*'* query, we allow the storage advisor to 
compute a better group-by-query clustering” 0 using either the 
hierarchical clustering algorithm in ED or Tunable-LSH. 

Our experiments indicate that on average, the time to compute 
the group-by-query clusters has decreased by an order of magni¬ 
tude with the introduction of Tunable-LSH. For example, for 
the lOOM triples dataset, it took 317.6 milliseconds on (geometric) 
average to compute the group-by-query clusters using the hierar¬ 
chical clustering algorithm in jTT), whereas with Tunable-LSH, 
it takes only about 26.1 milliseconds. This is due to the approxi¬ 
mate nature of Tunable-LSH. As shown in Tablej^and in Fig.|^ 
this approximation has a slight impact on query performance, but 
for the lOOM triples dataset, CDB is still significantly faster than 
the other RDF data management systems. There is one apparent 
reason for this: Tunable-LSH is an approximate method, and 
therefore, the generated group-by-query clusters are not perfect. To 
verify this hypothesis, we studied the logs generated during our ex¬ 
periments, which revealed the following: using the group-by-query 
clustering in 0 , chameleon-db’s query engine was able to execute 
64.8% of the queries without any decomposition (a property that 
chameleon-db’s query optimizer is trying to achieve 0 ), whereas, 
the group-by-query clustering computed using Tunable-LSH re¬ 
sulted in only 27.1% of the queries to be executed without decom¬ 
position. Of course, it is possible to improve chameleon-db’s query 
optimizer, but that is a topic for future research. 

This trade-off between the clustering overhead and the query ex¬ 
ecution time suggests that for RDF workloads that are too dynamic 
to be predicted and sampled upfront, it might be desirable to have 
frequent clustering steps, in which case, using Tunable-LSH is a 
much better option because of its lower overhead. 

6.2 Self-Clustering Hashtable 

The second experiment evaluates an in-memory hashtable that 
we developed that uses Tunable-LSH to dynamically cluster re¬ 
cords in the hashtable. Hashtahles are commonly used in RDF data 
management systems. For example, the dictionary in an RDF data 
management system, which maps integer identifiers to URIs or lit¬ 
erals (and vice versa), is often implemented as a hashtable (T|| 26 [ 
|58| . Secondary indexes can also be implemented as hashtahles, 
whereby the hashtable acts as a key-value store and maps tuple 
identifiers to the content of the tuples. In fact, in our own prototype 
RDF system, chameleon-db, all the indexes are secondary (dense) 
indexes because instread of relying on any sort order inherent in 
the data, we rely on the notion of group-by-query clusters, in which 
RDF triples are ordered purely based on the workload 00 - 

The hashtable interface is very similar to that of a standard hash- 
table; except that users are given the option to mark the beginning 
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(a) Random Access (All Data Structures) - 
Control how loaded the workloads are 



(b) Random Access - Control how loaded the 
workloads are 


(c) Random Access - Control record count, keep 
record size constant at 128 bytes 



(d) Random Access - Control workload 
dynamism 


(e) Sensitivity analysis of Tunable-LSH— 
Control workload dynamism 


(f) Sensitivity analysis of Tunable-LSH— 
Control Tunable-LSH parameter 2b 


Figure 7: Experimental evaluation of Tunable-LSFI in a self-clustering hashtable and the sensitivity analysis of Tunable-LSFI 


and end of queries. This information is used to dynamically cluster 
records such that those that are co-accessed across similar sets of 
queries also become physically co-located. All of the clustering 
and re-clustering is transparent to the user, hence, we call this the 
self-clustering hashtable. 

The self-clustering hashtable has the following advantages and 
disadvantages: Compared to a standard hashtable that tries to avoid 
hash-collisions, it deliberately co-locates records that are accessed 
together. If the workloads favour a scenario in which many records 
are frequently accessed together, then we can expect the self-clus¬ 
tering hashtable to have improved fetch times due to better CPU 
cache utilization, prefetching, etc. j^. On the other hand, these 
optimizations come with three types of overhead. First, every time 
a query is executed, Tunable-LSH needs to be updated (cf.. Al¬ 
gorithms and |^. Second, compared to a standard hashtable in 
which the physical address of a record is determined solely us¬ 
ing the underlying hash function (which is deterministic throughout 
the entire workload), in our case, the physical address of a record 
needs to be maintained dynamically because the underlying hash 
function is not deterministic (i.e., it is also changing dynamically 
throughout the workload). Consequently, there is the overhead of 
going to a lookup table and retrieving the physical address of a 
record. Third, physically moving records around in the storage sys¬ 
tem takes time—in fact, this is often an expensive operation. There¬ 
fore, the objective of this set of experiments is twofold: (i) to eval¬ 
uate the circumstances under which the self-clustering hashtable 
outperforms other popular data structures, and (ii) to understand 
when the tuning overhead may become a bottleneck. Consequently, 
we report the end-to-end query execution times, and if necessary, 
break it down into the time to (i) fetch the records, and (ii) tune the 
data structures (which includes all types of overhead listed above). 

In our experiments, we compare the self-clustering hashtable to 
popular implementations of three data structures. Specifically, we 
use: (i) std: : unordered jnai^^ which is the C++ standard library 


"http://www.cplusplus.com/reference/ 
unordered_map/unordered_map/ 


implementation of a hashtable, (ii) std: :ma^ which is the C++ 
standard library implementation of a red-black tree, and (iii) six: : - 
btre^ which is an open source in-memory B+ tree implementa¬ 
tion. As a baseline, we also include a static version of our hashtable, 
i.e., one that does not rely on Tunable-LSH. 

We consider two types of workloads: one in which records are 
accessed sequentially and the other in which records are accessed 
randomly. Each workload consists of 3000 queries that are syn¬ 
thetically generated using WatDiv Q. Eor each data structure, we 
measure the end-to-end workload execution time and compute the 
mean query execution time by dividing the total workload execu¬ 
tion time by the number of queries in the workload. 

Queries in these workloads consist of changing query access pat¬ 
terns, and in different experiments, we control different parameters 
such as the number of records that are accessed by queries on aver¬ 
age, the rate at which the query access patterns change in the work¬ 
load, etc. We repeat each experiment 20 times over workloads that 
are randomly generated with the same characteristics (e.g., average 
number of records accessed by each query, how fast the workload 
changes, etc.) and report averages across these 20 runs. We do not 
report standard errors because they are negligibly small and they 
do not add significant value to our results. 

For the sequential case, stx: :btree and std: :map outperform 
the hashtables, which is expected because once the first few records 
are fetched from main-memory, the remaining ones can already be 
prefetched into the CPU cache (due to the predictability of the se¬ 
quential access pattern). Therefore, for the remaining part, we fo¬ 
cus on the random access scenario, which is more common in RDF 
data management systems, and which can be a bottleneck even in 
systems like RDF-3x (sT) that have clustered indexes over all per¬ 
mutations of attributes. For more examples and a thorough expla¬ 
nation, we refer the reader to 0. 

In this experiment, we control the number of records that a query 
needs to access (on average), where each record is 128 bytes. Fig.[7^ 


^http://www.cplusplus.com/reference/map/map/ 
'https://panthema.net/2 007/stx-btree/ 
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compares all the data structures with respect to their end-to-end 
(mean) query execution times. Three observations stand out: First, 
in the random access case, the self-clustering hashtable as well as 
the standard hashtable perform much better than the other data 
structures, which is what would be expected. This observation 
holds also for the subsequent experiments, therefore, for presenta¬ 
tion purposes, we do not include these data structures in Fig. |7b}{7dl 
Second, the baseline static version of our hashtable (i.e., without 
Tunable-LSH) performs much worse than the standard hashtable, 
even worse than a B+ tree. This suggests that our implementation 
can be optimized further, which might improve the performance 
of the self-clustering hashtable as well (this is left as future work). 
Third, as the number of records that a query needs to access in¬ 
creases, the self-clustering hashtable outperforms all the other data 
structures, which verifies our initial hypothesis. 

For the same experiment above. Fig. focuses on the self¬ 
clustering hashtable versus the standard hashtable, and illustrates 
why the performance improvement is higher (for the self-clustering 
hashtable) for workloads in which queries access more records. 
Note that while the fetch time of the self-clustering hashtable scales 
proportionally with respect to std: : unordered_map, the tune over¬ 
head is proportionally much lower for workloads in which queries 
access more records. This is because with increasing “records per 
query count”, records can be re-located in batches across the pages 
in main-memory as opposed to moving individual records around. 

Next, we keep the average number of records that a query needs 
to access constant at 2000, but control the number of records in the 
database. As in the previous experiment, each record is 128 bytes. 
As illustrated in Fig. increasing the number of records in the 
database (i.e., scaling-up) favours the self-clustering hashtable. The 
reason is that, when there are only a few records in the database, the 
records are likely clustered to begin with. We repeat the same ex¬ 
periment, but this time, by controlling the record size and keep¬ 
ing the database size constant at 640 megabytes. Surprisingly, 
the relative improvement with respect to the standard hashtable re¬ 
mains more or less constant, which indicates that the improvement 
is largely dominated by the size of the database, and increasing it 
is to the advantage of the self-clustering hashtable. 

Finally, we evaluate how sensitive the self-clustering hashtable is 
to the dynamism in the workloads. Note that for the self-clustering 
hashtable to be useful at all, the workloads need to be predictable— 
at least to a certain extent. That is, if records are physically clus¬ 
tered but are never accessed in the future, then all those clustering 
efforts are wasted. To verify this hypothesis, we control the ex¬ 
pected number of query clusters (i.e., queries with similar but not 
exactly the same access vectors) in any 100 consecutive queries in 
the workloads that we generate. Let us call this property of the 
workload, its lOO-Uniqueness. Fig. |7^ illustrates how the tuning 
overhead starts to become a bottleneck as the workloads become 
more and more dynamic, to the extent of being completely unique, 
i.e., each query accesses a distinct set of records. 

6.3 Sensitivity Analysis of Tunable-LSH 

In the final set of experiments, we evaluate the sensitivity of Tu- 
NABLE-LSH in isolation, that is, without worrying about how it 
affects physical clustering, and compare it to three other hash func¬ 
tions: (i) a standard non-locality sensitive hash functiorj^ (ii) bit¬ 
sampling, which is known to be locality-sensitive for Hamming 
distances |40| , and (hi) Tunable-LSH without the optimizations 
discussed in Section]^ These comparisons are made across work¬ 
loads with different characteristics (i.e., dense vs. sparse, dynamic 

^http://en.cppreference.com/w/cpp/utility/ 
hash 


vs. stable, etc.) where parameters such as the average number of 
records accessed per query and the expected number of query clus¬ 
ters within any 100-consecutive sequence of queries in the work¬ 
load are controlled. 

Our evaluations indicate that Tunable-LSH generally outper¬ 
forms its alternatives. Due to space considerations, we cannot present 
all of our results in detail. Therefore, we will summarize our most 
important observations. 

Fig. 1^ shows how the probability that the evaluated hash func¬ 
tions place records with similar utilization vectors to nearby hash 
values changes as the workloads become more and more dynamic. 
In computing these probabilities, both the original distances (i.e., 
(5) and the distances over the hashed values (i.e., 5*) are normal¬ 
ized with respect to the maximum distance in each geometry. As 
illustrated in Fig.|^ Tunable-LSH achieves higher probability 
even when the workloads are dynamic. The unoptimized version of 
Tunable-LSH behaves more or less like a static locality-sensitive 
hash function, such as bit sampling, which is an expected result be¬ 
cause Tunable-LSH cannot achieve high accuracy without the 
workload-sensitive arrangement introduced in Section]^ It is also 
important to emphasize that even in that case Tunable-LSH is no 
worse than a standard LSH scheme, which is aligned with the the¬ 
orems in Section IST] We have not included the results on the stan¬ 
dard non-locality sensitive hash function, because, as one might 
guess, it has a probability distribution that is completely unparal¬ 
leled to our clustering objectives. 

Fig. 1^ demonstrates how the choice of b (or 2b as described 
in Section [53} affects the accuracy of Tunable-LSH. Having a 
higher b implies less and less undesirable collisions of query ac¬ 
cess vectors, hence, a higher accuracy. On the other hand, for bit 
sampling, the ideal number of samples is equal to the query clus¬ 
ters in the workload, thus, increasing b, which corresponds to the 
number of bits that are sampled, might result in oversampling and 
therefore, lower accuracy. For example, consider two record uti¬ 
lization vectors 1001 and 0001 with Hamming distance 1. If only 
1 bit is sampled, there is | probability that these two vectors will be 
hashed to the same value. On the other hand, if 2 bits are sampled, 
the probability drops to |. 

7. CONCLUSIONS AND FUTURE WORK 

In this paper, we introduce Tunable-LSH, which is a locality- 
sensitive hashing scheme, and demonstrate its use in clustering 
records in an RDF data management system. In particular, we keep 
track of the fragmented records in the database and use TUNA- 
BLE-LSH to decide, in constant-time, where a record needs to be 
placed in the storage system. Tunable-LSH takes into account 
the most recent query access patterns over the database, and uses 
this information to auto-tune such that records that are accessed 
across similar sets of queries are hashed as much as possible to the 
same or nearby pages in the storage system. This property distin¬ 
guishes Tunable-LSH from existing locality-sensitive hash func¬ 
tions, which are static. Our experiments with (i) a version of our 
prototype RDF data management system, chameleon-db, that uses 
Tunable-LSH, (ii) a hashtable that relies on Tunable-LSH to 
dynamically cluster its records, and (iii) workloads that rigorously 
test the sensitivity of Tunable-LSH verify the potential benefits 
of Tunable-LSH. 

As future work, it would be beneficial to answer the following 
questions. First, the assumption that the last k queries are repre¬ 
sentative of the future queries in the workload can be relaxed. As 
outlined in j^, the issue of deciding “when and based on what in¬ 
formation to tune the physical design” of our system still remains 
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an open problem. Second, as our experiments indicate, query op¬ 
timization in chameleon-db has significant room for improvement. 
We need techniques that can handle more approximate group-by- 
query clusters such as those generated by Tunable-LSH. Third, 
we believe that Tunable-LSH can be used in a more general set¬ 
ting than just RDF systems. In fact, it should be possible to extend 
the idea of the self-clustering in-memory hashtable that we have 
implemented to a more general, distributed key-value store. 
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